Small (breaking) refactor, refactor tests, add MIT license by svwingerden · Pull Request #104 · timaeus-research/devinterp

svwingerden · 2025-06-25T09:08:42Z

(Ported over from internal changes)

Init loss now logged even if online=False

…quired refactor) (#427) * WIP * FIX gradscaler for amp * RM some prints * FIX zero_grad in SGMCMC * FIX context management for autocast * WIP example notebook * FIX gradientscaler only if use_amp, add todo * FIX use bfloat16 instead, remove gradscaler * WIP update notebook * FIX autocast also on TPU? also change dtype of loaded model * RM old code * WIP cuda fixes * RM example notebook * RFC USE_XLA to BF16 and FP16 * RFC BF16 and FP16 * FIX Optional expects an argument * RM profiling code * RFC optimizer_kwargs to sampling_method_kwargs * WIP priot kwargs refactor * WIP TrainingArguments in CustomCheckpointCallback * FIX CustomCheckpointCallback refactor * FIX SGMCMC vs SGLD incorrect distance comparison (weight decay currently broken) * FIX SGMCMC w/ wd * UNDO bfloat16 changes * FIX pythia model now not automatically loaded to cuda >:( * ADD working pythia6.9B yaml * FIX pythia test * FIX BF16 now default * CHG docs * CHG yaml files * FIX prior_kwartgs in optimizers and samplers * FIX tests * RFC Checkpointer now uses model_args * FIX loss tests * CHG devinterp typing changes * RFC BF16 now default * FIX dataclass dtype should be str * CHG prep for the big run * CHG n_ctx to 512 * FIX test formatting, dtype getattr magic * FIX NUM_GPUS for single-gpu machines * undo to 512 * UNDO pythia checkpoint is not 1024 * FIX checkpoint test * FIX checkpointer test, loss test precision * WIP refactor BF16 type, getattr back in sample loop * FIX typing * FIX yaml requires strings for unknown types? * FIX sampler dtype change * FIX typing of dtype arg * FIX memory usage handling * FIX don't copy and move model is using SPMD * FIX Jesse's minor requests * FIX formatting (sorry Jesse) * FIX is_dataclass in hparams

* gitignore updated * n_ctx sent to tokenize_dataset * hooked up n_ctx to every instance of get_datasets * last get_datasets * printing config' * using pretty printer * printing task metadata * rm nctx req * pretty printing metadata * added update n_ctx function * logging context len * right quotation * context length up top * change applied everywhere * added test * black * added n_ctx to test yaml * equals * Move update_n_ctx call before logger configuration * Remove unused parameter `n_ctx` from function * Add newline in _get_tokenize_kwargs function * fixed quantize action typo * rm commented out steps * black * suprious commit * tokenizer max length set * removed n_ctx setting * Fix logger comment formatting in llc.py * installed tpu thing * Remove 'hub/' from .gitignore * FIX bug * printing max len * FIX test * FIX loss tests now use pythia temporarily, update snapshots to match this change * FIX tokenizer tests * FIX formatting * FIX formatting even more * FIX merge issue, test * FIX workflow file now shows which tests fail * WIP CI/CD * CHG env vars for loss tests --------- Co-authored-by: Jesse Hoogland <jessequinten@gmail.com> Co-authored-by: svwingerden <stanvanwingerden@gmail.com>

* RFC grad_accum_steps -> gradient_accumulation_steps * FIX formatting * nonsense commit to test CI/CD * [Aether] Fix init loss (#483) * Account for gradient accumulation steps when computing init loss * Linting --------- Co-authored-by: Claude <claude@anthropic.com> Co-authored-by: George <gwang24@gmail.com>

…ter (#443) * prototype pyproject usage and Makefile changes * Commit ready for PR * Use lock instead of sync * Upgraded lock file and fixed newline formatting * moved dev dependencies to dev dependency groups and updated submodule_test.yml for uv * FIX formatting * Improve UV installation and detection in Makefile * Add pre-commit install steps to Makefile * Fixing pytest execution (run aether tests from root dir, not shared/aether) --------- Co-authored-by: Stan van Wingerden <stanvanwingerden@gmail.com>

…argparsing test

* Add changes (dirty branch, don't merge to main) * Add dirty changes * Add dirty changes * Stashing changes * Stashing changes * Add process group cleanup + dataloader distributed sampling * Add all-reduce on metrics * Fix all-reduce device issue * Dirty commit (not working) - halfway fix for the checkpoint loading problem * Super dirty commit - blocked on checkpoint loading problem * Fix a bunch of issues. Debug statements remain * First clean-ish commit - functioning FSDP for a fixed model set in llc.py. Dataloader splitting is still problematic - there's replication of data between GPUs. * Formatting with black * Added profiling to examine FSDP memory imbalance * Fix sampler batch size * Fixes for the layernorm problem and the dataloader interleaving problem. * Move destroy_process_group after action completion + fix OMP worker overload bug * Add gradient checkpointing * Add FSDP/non-FSDP comparison test * Bugfix for non-FSDP gpu smampling * Add snapshot tests * Fix fsdp pytest * Black formatting * Add seeding + rmsprop tests for FSDP * Simplify fsdp test + add syrupy snapshot * Format tests with Black * ADD TODOs for Stan * FIX SPMD multi chain, address some old will comments * update snapshot * FIX snapshot for fsdp * FIX tests I broke accidentally * FIX snapshot test full precision * ok now the tests should really pass w/ full precision --------- Co-authored-by: Stan van Wingerden <stanvanwingerden@gmail.com>

[CI/CD] Add GPU test action

…tests-shared-fixtures [Aether] Einar/eng 178 refactor tests shared fixtures

* ADD TPU test * CHG TPU config * ADD misc * Readd ssh key * ADD smoke test * ADD import torch_xla * FIX tpu dependencies * FIX dep for gpu? * Readd torch_xla explicitly * Actually fix * Maybe this is better? * Ready now? * Don't restrict tests * Shrink tpu test * ADD server port for gpu * FIX test? * FIX GPU ssh connectivity test * ADD TPU_TYPE * FIX TPU_TYPE passing * CHG secrets. -> env. * CHG env. -> vars. * UPD test * ADD cln up comparison * UPD test? * More relaxed tests * Try second round of tests on tpu * FIX aether tests * UPDATE sync * ADD skips if missing file * FIX tests? * FIX tests for real * Consolidate tests into one * Final fix? * Skip breaking test * Skip breaking test * Skip breaking test * FIX gpu? * Skip TPU test if WIP * NVM * Actually skip * Skip TPU if WIP * CHG which test is being run * Rerun * Comment out failing test * Trust o3 * Wrong indenting * RM find-links * Skip 31m * Reward-hacked my way out of this * FIX checkpoint naming * Update tpu_test.yml * Update gpu_test.yml * Update tpu_test.yml * FIX tests for CI/CD * CHG verbose now false in test_tpus, check for SPMD in gpu tests * FIX formatting * FIX failing data tests (and bypass one) --------- Co-authored-by: svwingerden <stanvanwingerden@gmail.com>

* make * imports explicit * move to snapshots for sampler_accuracy_test * try up validation chains & draws * Remove normal-crossing tests, as they are pretty inconsistent and don't converge well. * Refactor rrr_test, including caching model training and using snapshots * Format with black * Run snapshot test first every time, for rng seeding consistency. * fix mismatch between snapshot/non-snapshot * Comment out sampler_ordinality test for now, to debug CI faster * Refactor rllc_test into fast/slow paradigm * Seed the dataset generator, and re-snapshot * Make sure dataloaders are deterministic in conftest. * Fix rllc_test failures by commenting out powers = [1, 1] test. * Refactor ordinality test into snapshot format. * Reformat ordinality test. * Format conftest * Remove burnin steps comment * Add architecture check, and skip some tests if not x86_64.

* Add fix for pyright not finding installed packages. * Fix relative path in timaeus_cli pyproject.toml --------- Co-authored-by: Stan van Wingerden <stanvanwingerden@gmail.com>

* WIP use self-hosted runners for speedup? * WIP * remove ssh for ci/cd * FIX GPU cicd * ADD concurrency to all git ci/cd jobs * concurrency check * devinterp tests don't need our cool machines * CHG caching for cicd on gpu * skip two failing ci/cd tests * CHG cicd caching * CHG pytest skip markings * FIX hessian test * FIX failing CI/CD * CHG CI/CD to use correct DISK_PATH * CHG -n 2 to prevent DDoS * CHG workflow yamls caching, n back to auto * FIX formatting

* CHG use ruff isntead of black * FIX CI/CD * CHG .vscode settings and suggested extensions * delete the empty ipynb files * Update Makefile Co-authored-by: William Snell <59493198+williamsnell@users.noreply.github.com> * FIX aiohttp from merge conflict * CHG don't lint on commit, don't lint projects folder * don't lint/ format projects folder * FIX formatting ;) * WIP pre-commit, CI/CD, config, makefile * WIP add linting check (that will fail) * CHG disable linting autofix * FIX linting & formatting * WIP CI/CD * FIX makefile setup old precommit hooks, attempt 2 * WIP CICD * WIP CICD * FIX CICD * slow failing mala tests --------- Co-authored-by: William Snell <59493198+williamsnell@users.noreply.github.com>

svwingerden and others added 17 commits June 25, 2025 10:57

hotfix formatting

fa792d7

[Devinterp] FIX logging of init loss (#277)

73c1ab5

Init loss now logged even if online=False

HOTFIX lockfile w/ syrupy, remove syrupy import, fix argparsing, add …

1ed4a8a

…argparsing test

Merge pull request #566 from timaeus-research/jesse/eng-386-cicd-on-gpu

607bbe6

[CI/CD] Add GPU test action

Merge pull request #606 from timaeus-research/einar/eng-178-refactor-…

41db00c

…tests-shared-fixtures [Aether] Einar/eng 178 refactor tests shared fixtures

Add fix for pyright not finding installed packages. (#748)

0b315ca

* Add fix for pyright not finding installed packages. * Fix relative path in timaeus_cli pyproject.toml --------- Co-authored-by: Stan van Wingerden <stanvanwingerden@gmail.com>

Merge branch 'main' into georgeyw/tetrahedron-seeds

d59724d

Merge branch 'main' into williamsnell/eng-510

9eba2d8

svwingerden temporarily deployed to testpypi June 25, 2025 09:09 — with GitHub Actions Inactive

svwingerden merged commit dc98dfb into main Jun 25, 2025
6 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Small (breaking) refactor, refactor tests, add MIT license#104

Small (breaking) refactor, refactor tests, add MIT license#104
svwingerden merged 17 commits intomainfrom
rfc-for-monorepo-license-and-update

svwingerden commented Jun 25, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

Conversation

svwingerden commented Jun 25, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants